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Abstract 

Recommender system has been more and more popular and widely 
used in many applications recently. The increasing information avail- 
able, not only in quantities but also in types, leads to a big challenge 
for recommender system that how to leverage these rich information 
to get a better performance. Most traditional approaches try to de- 
>■ sign a specific model for each scenario, which demands great efforts 

in developing and modifying models. In this technical report, we de- 
scribe our implementation of feature-based matrix factorization. This 
CN model is an abstract of many variants of matrix factorization models, 

Q\ and new types of information can be utilized by simply defining new 

features, without modifying any lines of code. Using the toolkit, we 
built the best single model reported on track 1 of KDDCupTl. 

X 1 Introduction 



Recommender systems that recommends items based on users interest 
has become more and more popular among many web sites. Collabo- 
rative Filtering(CF) techniques that behind the recommender system 
have been developed for many years and keep to be a hot area in 
both academic and industry aspects. Currently CF problems face two 
kinds of major challenges: how to handle large-scale dataset and how 
to leverage the rich information of data collected. 

Traditional approaches to solve these problems is to design spe- 
cific models for each problem, i.e writing code for each model, which 
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demands great efforts in engineering. Matrix factorization(MF) tech- 
nique is one of the most popular method of CF model, and exten- 
sive study has been made in different variants of matrix factorization 
model, such as [3j[4] and [5]. However, we find that the majority of 
matrix factorization models share common patterns, which motivates 
us to put them together into one. We call this model feature-based 
matrix factorization. Moreover, we write a toolkit for solving the gen- 
eral feature-based matrix factorization problem, saving the efforts of 
engineering for detailed kinds of model. Using the toolkit, we get the 
best single model on track 1 of KDDCup'll[2 . 

This article serves as a technical report for our toolkit of feature- 
based matrix factorization^ We try to elaborate three problems in 
this report, i.e, what the model is, how can we use such kind of model, 
and additional discussion of issues in engineering and efficient compu- 
tation. 

2 What is feature based MF 

In this section, we will describe the model of feature based matrix 
factorization, starting from the example of linear regression, and then 
going to the full definition of our model. 

2.1 Start from linear regression 

Let's start from the basic collaborative filtering models. The very 
baseline of collaborative filtering model may be the baseline models 
just considering the mean effect of user and item. See the following 
two models. 

f U i = m + K (l) 

fui = At + b u + h (2) 

Here /_ is a constant indicating the global mean value of rating. Equa- 
tion [T] describe a model considering users' mean effect while Equation 
[2] denotes items' mean effect. A more complex model considering the 
neighborhood information |3j is as follows 

f u i = fJ> + hi + b u + \R(u)\~2 ^ Sij(r uj -b u ) (3) 

jeR(u) 

Here R(u) is the set of items user u rate, b u is a user average rating 
pre-calculated. Sij means the similarity parameter from i to j . Sij is a 
parameter that we train from data instead of direct calculation using 

1 http://apex. sjtu.edu.cn/apex_wiki/svdfeature 
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memory based methods. Note b u is different from b u since it's pre- 
calculated. This is a neighborhood model that takes the neighborhood 
effect of items into consideration. 

Assuming we want to implement all three models, it seems to be 
wasting to write code for each of the model. If we compare those 
models, it is obvious that all the three models are special cases of 
linear regression problem described by Equation [4] 

V = ^2 WiXi ( 4 ) 

i 

Suppose we have n users, m items, and h total number of possible s^- in 
equation [3j We can define the feature vector x — [xq, x\, • • • , x n ^ m ^.f l ] 
for user item pair < u, i > as follows 



Xk 



Indicator(u == k) k < n 
Indicator(i == k — n) n < k < n + m 
k > m + n, j ^ R(u), Sij means Wk 

\R(u)\~z (r u j — b u ) k > m + n, j £ R(u),Sij means Wk 

(5) 

The corresponding layout for weight w shown in equation [6] Note 
that choice of pairs can be flexible. We can choose only possible 
neighbors instead of enumerating all the pairs. 

w = [b u (0),b u (l), ■■■ , b u (n), bi(l), ■ ■ ■ bi(m) ■■■s ij ---] (6) 

In other words, equation [3] can be reformed as the following form 

f u i = V + hi + b u l + ^2 s ij \ R ( u )\~H r uj ~ b u ) (7) 
jeR(u) 

where b(, b u , Sjj corresponds to weight of linear regression, and the 
coefficients on the right of the weight are the input features. In sum- 
mary, under this framework, the only thing that we need to do is to 
layout the parameters into a feature vector. In our case, we arrange 
first n features to b u then b{ and s^, then transform the input data 
into the format of linear regression input. Finally we use a linear 
regression solver to work the problem out. 



2.2 Feature based matrix factorization 

The previous section shows that some baseline CF algorithms are lin- 
ear regression problem. In this section, we will discuss feature-based 
generalization for matrix factorization. A basic matrix factorization 
model is stated in Equation [8} 

f U i = i± + b u + bi + pTqi (8) 
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Figure 1: Feature-based matrix factorization 



The bias terms have the same meaning as previous section. We also 
get two factor term p u and qi . p u models the latent peference of user 
u. qi models the latent property of item i. 

Inspired by the idea of previous section, we can get a direct gen- 
eralization for matrix factorization version. 



+ b u + bi + p u qi 



(9) 



Equation [9] adds a linear regression term to the traditional matrix 
factorization model. This allows us to add more bias information, such 
as neighborhood information and time bias information, etc. However, 
we may also need a more flexible factor part. For example, we may 
want a time dependent user factor p u (t) or hierarchical dependent 
item factor qi(h). As we can find from previous section, a direct way 
to include such flexibility is to use features in factor as well. So we 
adjust our feature based matrix factorization as follows 

T 

\ 3 3 3 

(10) 

The input consists of three kinds of features < a, (3, 7 >, we call a user 
feature, (3 item feature and 7 global feature. The first part of Equation 
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The name of these features explains their meanings, a describes 
the user aspects, j3 describes the item aspects, while 7 describes some 
global bias effect. Figure [T] shows the idea of the procedure. 

We can find basic matrix factorization is a special case of Equation 



10 For predicting user item pair < u, i >, define 
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k = u 
k / u 



k = i 
k ^ i 



(11) 



We are not limited to the simple matrix factorization. It enables 
us to incorporate the neighborhood information to 7, and time de- 
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pendent user factor by modifying a. Section [3] will present a detailed 
description of this. 



2.3 Active function and loss function 

There, you need to choose an active function /(•) to the output of the 
feature based matrix factorization. Similarly, you can also try various 
of loss functions for loss estimation. The final version of the model is 

r = f(y) (12) 

Loss = L(r, r) + regularization (13) 
Common choice of active functions and loss are listed as follows: 

• identity function, L2 loss, original matrix factorization. 

r = f(y) = y (14) 

Loss = (r — f) 2 + regularization (15) 

• sigmoid function, log likelihood, logistic regression version of ma- 
trix factorization. 

* = M = TTe^ (16) 

Loss = r\nf + (1 — r) ln(l — f) + regularization (17) 

• identity function, smoothed hinge loss [7], maximum margin ma- 
trix factorization [S] [7] • Binary classification problem, r G {0, 1} 

Loss = h ((2r — l)y) + regularization (18) 

{\-z z < 

|(1 -z) 2 0<z<l (19) 
z > 1 
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2.4 Model Learning 

To update the model, we use the following update rule 



Pi=Pi + V ^eo>i qjPj J - XiPij (20) 

Qi = qi + V ^Z^"^ ~ Xm j ( 21 ) 

&?> = 6?> + 17 (*K - As^) (22) 

t ] = t ] + ^ (&* - A 4 6! u) ) (23) 

&« = 6« + »7 (eA - A 5 &^) (24) 



Here e = r — f the difference between true rate and predicted rate. 
This rule is valid for both logistic likelihood loss and L2 loss. For 
other loss, we shall modify e to be corresponding gradient. 77 is the 
learning rate and the As are regularization parameters that defines the 
strength of regularization. 

3 What information can be included 

In this section, we will present some examples to illustrate the usage 
of our feature-based matrix factorization model. 

3.1 Basic matrix factorization 

Basic matrix factorization model is defined by following equation 

y = fj, + b u + bi+ p T u qi (25) 
And the corresponding feature representation is 

T _«,«^_{i j-« *-j (26) 

3.2 Pairwise rank model 

For the ranking model, we are interested in the order of two items i,j 
given a user u. A pairwise ranking model is described as follows 

P(r u i > r u j) = sigmoid (fi + k - bj + p T u {qi - qj)) (27) 
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The corresponding features representation are like this 



^ 9 - Q ' = {J Ul •"'Hi 1 . (28) 

k { k^i,k^j 

by using sigmoid and log-likelihood as loss function. Note that the 
feature representation gives one extra b u which is not desirable. We 
can removed it by give high regularization to b u that penalize it to 0. 



3.3 Temporal Information 

A model that include temporal information^ can be described as 
follows 

y = [i + b u (t) + b t {t) + b u + h + (p u +p u (t)) T qi (29) 

We can include bi{t) using global feature, and b u (t), p u (t) using user 
feature. For example, we can define a time interpolation model as 
follows 

,,e — t 1P t — s ( „e — t e t-s\ T . . 

y = l i + b l + b s u +b e u + [pi +p e u q t 30 

e — s e — s \ e — s e — s J 

Here e and s mean start and end of the time of all the ratings. A 
rating that's rated later will be affected more by p e and b e and earlier 
ratings will be more affected by p s and 6 s . For this model, we can 
define 

k = u , 
--0. a u -{ y k = u + n , (3 k = l J (31) 
otherwise 

Note we first arrange the p s in the first n features then p e in next n 
features. 



3.4 Neighborhood information 

A model that include neighborhood information [3j can be described 
as below: 

y = V+ ^2 Si i \ R ( u )\~H r uj ~ b u ) +b u + bi+pTqi (32) 
jeR(u) 

We only need to implement neighborhood information to global fea- 



tures as described by Section 2.1 
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3.5 Hierarchical information 



In Yahoo! Music Dataset[2], some tracks belongs to same artist. We 
can include such hierarchical information by adding it to item feature. 
The model is described as follows 

y = 11 + b u + b t + b a + p[(qt + q a ) (33) 

Here t means track and a denotes corresponding artist. This model 
can be formalized as feature-based matrix factorization by redefining 
item feature. 



4 Efficient training for SVDH — \- 

Feature-based matrix factorization can naturally incorporate implicit 
and explicit information. We can simply add these information to user 
feature a. The model configuration is shown as follows: 



bias + Y, ZiPi + E a i d i E ft* ^ 



Here we omit the detail of bias term. The implicit and explicit feed- 
back information is given by Ylj a jdj> where a is the feature vec- 



tor of feedback information, aj = h = = for implicit feedback, and 



a.j = ^| f° r explicit feedback, dj is the parameter of implicit 

and explicit feedback factor. We explicitly state out the implicit and 
explicit information in Equation |34| 



Although Equation 34 shows that we can easily incorporate im- 
plicit and explicit information into the model, it's actually very costly 
to run the stochastic gradient training, since the update cost is linear 
to the size of nonzero entries of a, and a can be very large if a user 
has rated many items. This will greatly slow down the training speed. 
We need to use an optimized method to do training. To show the idea 
of the optimized method, let's first define a derived user implicit and 
explicit factor p m as follows: 



Jim 



3 

The update of dj after one step is given by the following equation 

Ad j = V ea j E&fc' ( 36 ) 
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The resulted difference in p %m is given by 

Ap- = ?7 e^>fj feftq^J (37) 

Given a group of samples with the same user, we need to do gradient 
descent on each of the training sample. The simplest way is to do the 
following steps for each sample: (1) calculate p im to get prediction 
(2) update all dj associates with implicit and explicit feedback. Every 
time p im has to be recalculated using updated dj in this way. However, 
we can find that to get new p im , we don't need to update each dj. 



Instead, we only need to update p tm using Equation 37 What's more 



we can find there is a relation between Ap lTn and Adj as follows: 

Adj = =^Ap tm (38) 

We shall emphasize that Equation [38] is true even for multiple updates, 
given the condition that the user is same in all the samples. We shall 
mention that the above analysis doesn't consider the regularization 
term. If L2 regularization of dj is used during the update as follows: 

Adj = rj (eaj Pi^j ~ H j (39) 
The corresponding changes in p %m also looks very similar 

Af m = Je[j2» 2 A |E^i) -v m ) ( 4 °) 



However, the relation in Equation 38 no longer holds strictly. But we 



can still use the relation since it approximately holds when regular- 
ization term is small. Using the results we obtained, we can develop a 
fast algorithm for feature-based matrix factorization with implicit and 
explicit feedback information. The algorithm is shown in Algorithm 

m 

We find that the basic idea is to group the data of the same user 
together, for the same user shares the same implicit and explicit feed- 
back information. Algorithm [T] allows us to calculate implicit feedback 
factor only once for a user, greatly saving the computation time. 



5 How large-scale data is handled 

Recommender system confronts the problem of large-scale data in 
practice. This is a must when dealing with real problems. For ex- 
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Algorithm 1 Efficient Training for Implicit and Explicit Feedback 
for all user u do 
pim ^_ ^2 ajdj {calculating implicit feedback} 

for all training samples of user u do 

update other parameters, using p %m to replace ^ • ajdj 

update p im directly , do not update dj. 
end for 

for all i, OLi ^ do 

di di + 2 (p im - p old ) {add all the changes back to d} 
end for 
end for 



ample Yahoo! Music Datasetp] consists of more than 200M ratings. 
A toolkit that's robust to input data size is desirable for real applica- 
tions. 

5.1 Input data buffering 

The input training data is extremely large in real application, we don't 
try to load all the training data into memory. Instead, we buffer all 
the training data through binary format into the hard-disk. We use 
stochastic gradient descend to train our model, that is we only need 
to linearly iterate over the data if we shuffle our data before buffering. 

Therefore, our solution requires the input feature to be previously 
shuffled, then a buffering program will create a binary buffer from the 
input feature. The training procedure reads the data from hard-disk 
and uses stochastic gradient descend to train the model. This buffering 
approach makes the memory cost invariant to the input data size, and 
allows us to train models over large-scale of input data so long as the 
parameters fit into memory. 

5.2 Execution pipeline 

Although input data buffering can solve the problem of large-scale 
data, it still suffers from the cost of reading the data from hard-disk. 
To minimize the cost of I/O, we use a pre- fetching strategy. We create 
a independent thread to fetch the buffer data into a memory queue, 
then the training program reads the data from memory queue and do 
training. The procedure is shown in Figure [2] 

This pipeline style of execution removes the burden of I/O from 
the training thread. So long as I/O speed is similar or faster to train- 
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Figure 2: Execution pipeline 



ing speed, the cost of I/O is negligible, and our our experience on 
KDDCup'll proves the success of this strategy. With input buffering 
and pipeline execution, we can train a model with test RMSE=22.16 
for trackl in KDDCup'110 using less than 2G of memory, without 
significantly increasing of training time. 



6 Related work and discussion 

The most related work of feature based matrix factorization is Fac- 
torization Machine [6] . The reader can refer to libFM^] for a toolkit 
for factorization machine. Strictly speaking, our toolkit implement a 
restricted case of factorization machine and is more useful in some as- 
pects. We can support global feature that doesn't need to be take into 
factorization part, which is important for bias features such as user 
day bias, neighborhood based features, etc. The divide of features 
also gives hints for model design. For global features, we shall con- 
sider what aspect may influence the overall rating. For user and item 
features, we shall consider how to describe the user preference and item 
property better. Our model is also related to [1] and [9], the difference 
is that in feature-based matrix factorization, the user /item feature 
can associate with temporal information and other context informa- 
tion to better describe the preference or property in current context. 
Our current model also has shortcomings. The model doesn't support 
multiple distinct factorizations at present. For example, sometimes 
we may want to introduce user vs time tensor factorization together 
with user vs item factorization. We will try our best to overcome these 
drawbacks in the future works. 
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